Skip to content

[ManagedIdentity] Detect dead KeyGuard keys and purge orphan IMDSv2 mTLS certs on reboot#6037

Closed
gladjohn wants to merge 1 commit into
mainfrom
gladjohn/reboot-vm-proposal
Closed

[ManagedIdentity] Detect dead KeyGuard keys and purge orphan IMDSv2 mTLS certs on reboot#6037
gladjohn wants to merge 1 commit into
mainfrom
gladjohn/reboot-vm-proposal

Conversation

@gladjohn
Copy link
Copy Markdown
Contributor

Summary

Fixes the post-reboot recovery path for IMDSv2 mTLS PoP token acquisition on Azure VMs with KeyGuard.

On VM restart the per-boot KeyGuard key (NCryptUsePerBootKeyFlag) is reaped by VBS, but the persisted binding cert under CN=managedidentitysnissuer.login.microsoft.com in CurrentUser\My still references the old public key. Two failure modes today:

  1. CngKey.Open throws HR=0x8009003A (Cannot decrypt a VBS-isolated key.) — handled by minting fresh, but the orphaned cert was left behind, so a subsequent cold path could still surface it before the reactive IsSchanelFailure catch in ImdsV2ManagedIdentitySource kicks in.
  2. Zombie-VBS handleCngKey.Open succeeds, the container's public-key metadata survived a reboot, but the private material is dead. ExportParameters(false) still returns the old modulus that matches the persisted cert, so any cert-side modulus comparison can't detect this case. SChannel handshake then fails and we fall through to the reactive catch.

Changes

WindowsCngKeyOperations.cs

  • CanSign liveness probe right after CngKey.Open succeeds. 1-byte RSA-SHA256 PKCS1 sign; ~1–3 ms; runs once per process (the result is cached in WindowsManagedIdentityKeyProvider._cachedKey behind a SemaphoreSlim). Catches the zombie-handle variant cleanly.
  • PurgeManagedIdentityCertificates — one-shot, issuer-CN substring sweep (managedidentitysnissuer.login.microsoft.com, case-insensitive) of CurrentUser\My, invoked at the moment a fresh KeyGuard key is minted (both probe-failed and Open-threw branches).

WindowsCngKeyOperationsPurgeUnitTests.cs

Four Windows-only unit tests for the purge filter behavior:

  • RemovesCertWithMatchingIssuer
  • LeavesCertWithNonMatchingIssuer
  • MatchIsCaseInsensitive
  • OnlyRemovesMatching_LeavesOtherCertsAlone

Tests use CertificateRequest-based self-signed PFX with a discriminating Subject OU (MSAL-Purge-Test-<Guid>) and ImdsV2TestStoreCleaner.RemoveAllTestArtifacts() in [TestInitialize].

Retained

The reactive SChannel catch in ImdsV2ManagedIdentitySource.AuthenticateAsync is kept as a defensive backstop.

Why purge at the mint site

When the container was just regenerated we know every cert under that issuer is orphaned by definition — no need to inspect them individually.

  • No per-Read discovery cost on cold cache after reboot.
  • Multi-identity hosts (SAMI + multiple UAMIs sharing the same KeyGuard container) are cleaned up uniformly in a single store-open.
  • Lower first-call latency post-reboot — the first request after restart hits a clean store and a single /issuecredential POST.

Validation

Validated E2E on a Server 2022 KeyGuard VM across multiple reboots and mixed SAMI/UAMI cases. Canonical post-reboot first call:

CngKey.Open threw CryptographicException HR=0x8009003A
Fresh KeyGuard key created
PurgeManagedIdentityCertificates: removed cert. Thumbprint=D3AE2783… (Inspected=4, Removed=1)
MAA attestation OK
POST /issuecredential -> 200
mTLS handshake -> 200 on first try (no reactive catch invoked)
Total ~2.8s on cold start
x5t#S256: 11Iz2a_ZOO0rhl2NzCBOVW75ul5Bg1M24rFKY8q1kEA

Full unit suite green on net8.0: 2069 passed, 0 failed, 19 skipped.

Refs

Draft status

Opening as draft to gather feedback alongside #6020 before deciding the final shape. Happy to:

…TLS certs on reboot

Fixes the post-reboot recovery path for IMDSv2 mTLS PoP token acquisition.
On Azure VM restart the per-boot KeyGuard key (NCryptUsePerBootKeyFlag) is
reaped by VBS, but the persisted binding cert under
CN=managedidentitysnissuer.login.microsoft.com still references the old
public key. The next call then either burns a failed TLS handshake before
the reactive SChannel catch kicks in, or — in the zombie-handle variant —
falls through entirely because the cert's modulus still matches the dead
container.

Changes
-------
- Add CanSign liveness probe right after CngKey.Open in
  WindowsCngKeyOperations.TryGetOrCreateKeyGuard. 1-byte RSA-SHA256 PKCS1
  sign; ~1-3ms, runs once per process (result is cached in
  WindowsManagedIdentityKeyProvider._cachedKey). Catches zombie-VBS state
  where Open succeeds but private material is dead.

- Add PurgeManagedIdentityCertificates: one-shot issuer-CN substring sweep
  of CurrentUser\My, invoked at the moment a fresh KeyGuard key is minted
  (both the probe-failed path and the Open-threw path). Removes orphaned
  binding certs at the cause site so the next request doesn't pay any
  per-Read discovery cost and multi-identity hosts (SAMI + UAMIs sharing
  the KeyGuard container) are cleaned up uniformly.

- Add 4 Windows-only unit tests for the purge filter behavior (matching,
  non-matching, case-insensitive, only-removes-matching).

The reactive SChannel catch in ImdsV2ManagedIdentitySource is retained as
a defensive backstop.

Validation
----------
Validated E2E on a Server 2022 KeyGuard VM across multiple reboots and
mixed SAMI/UAMI cases. Canonical post-reboot first call:
  - CngKey.Open threw CryptographicException HR=0x8009003A
  - Fresh KeyGuard key created
  - PurgeManagedIdentityCertificates removed orphan cert (Inspected=4)
  - MAA attestation OK
  - POST /issuecredential -> 200
  - mTLS handshake -> 200 on first try (no reactive catch invoked)
  - Total ~2.8s on cold start

Full unit suite green on net8.0: 2069 passed, 0 failed, 19 skipped.

Refs #6031.
Complementary to #6020 (cert-side modulus comparison): this PR adds the
key-side liveness probe and broad issuer-CN sweep at the mint site.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the post-reboot recovery path for IMDSv2 mTLS PoP on Windows KeyGuard VMs by proactively detecting stale per-boot KeyGuard keys and purging orphaned IMDSv2 binding certificates from the user cert store when a fresh key is minted.

Changes:

  • Add an RSA signing liveness probe immediately after CngKey.Open to detect “zombie” per-boot KeyGuard handles and recreate the key when necessary.
  • Add PurgeManagedIdentityCertificates to remove IMDSv2-issued binding certs from CurrentUser\My when the KeyGuard key is re-minted.
  • Add Windows-only unit tests validating the purge issuer-filter behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/client/Microsoft.Identity.Client/ManagedIdentity/KeyProviders/WindowsCngKeyOperations.cs Adds KeyGuard liveness probe and best-effort certificate-store purge for IMDSv2 issuer certs.
tests/Microsoft.Identity.Test.Unit/ManagedIdentityTests/WindowsCngKeyOperationsPurgeUnitTests.cs Adds Windows-only unit tests for purge matching and non-matching issuer behavior.

Comment on lines +439 to +457
// Snapshot to avoid 'collection modified during enumeration' provider quirks.
var snapshot = new X509Certificate2[store.Certificates.Count];
try
{
store.Certificates.CopyTo(snapshot, 0);
}
catch (Exception copyEx)
{
logger?.Info(() =>
$"[MI][WinKeyProvider] PurgeManagedIdentityCertificates: store snapshot via CopyTo failed " +
$"({copyEx.GetType().Name}: {copyEx.Message}). Falling back to enumeration.");

int i = 0;
snapshot = new X509Certificate2[store.Certificates.Count];
foreach (X509Certificate2 c in store.Certificates)
{
snapshot[i++] = c;
}
}
Comment on lines +474 to +487
try
{
store.Remove(candidate);
removed++;
logger?.Info(() =>
$"[MI][WinKeyProvider] PurgeManagedIdentityCertificates: removed cert. " +
$"Thumbprint={thumb}, NotAfter={notAfter:O}, Issuer='{issuer}'.");
}
catch (Exception removeEx)
{
logger?.Info(() =>
$"[MI][WinKeyProvider] PurgeManagedIdentityCertificates: failed to remove cert " +
$"Thumbprint={thumb}. {removeEx.GetType().Name}: '{removeEx.Message}'.");
}
@gladjohn gladjohn closed this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] IMDSv2 keys misbehave on restart

2 participants